In this lab, we we’ll take our first look at neutrality statistics, which are statistical tests designed to measure if whether SNPs appear to be undergoing selection within a population. We’ll go over what each of the statistics are, what they mean, and how to interpret them, and then we’ll use them on our own data (with the exception of our final statistic, iHS, which is for whole sequence data, not variant calls).
First, let’s get an understanding of what each of these statistics might mean when calculated for our populations…
This is a fairly common neutrality statistic. For this lab, we’ve read the original paper by Tajima, which details the development of his D statistic as a way of testing for the neutral mutation hypothesis (or Kimura’s theory of neutral mutation, as we’ve discussed it in class). Tajima’s D tests whether the pattern of mutations found in a particular region of the genome in a population is following the assumptions of neutrality models; in other words, if there has been selection for or against a particular allele in the population, the pattern seen will violate the assumptions of the neutral model.
With Tajima’s D, the closer to 0 the D statistic is, the closer the locus being tested is to meeting the assumptions of neutrality. In other words, when D is close to 0 there are no alleles over- or under-represented at that locus across the population, suggesting that selection is not acting on that locus in the population. If the D statistic is a negative number, this means the population is experiencing purifying or negative selection against an allele at that locus. Conversely, if the D statistic is positive, this means that the population is experiencing positive selection at the locus.
Today, we’ll calculate Tajima’s D values for our populations in order to determine if any directional selection is happening at the UCP1 locus.
Fu and Li’s D and F are also fairly straightforward tests derived from the same statistics as Tajima’s D. Like Tajima’s D, these statistics are also based upon Kimura’s neutrality statistic, again meaning that anything deviating from 0 is a violation of neutrality. What these statistics do, specifically, is measure the number of singleton or private mutations (i.e., the number of people within a sample that have a novel mutation specific to themselves). While the D statistic is based on the difference between the number of singletons in the population and the total number of mutations in population, the F statistic is based on the difference between the number of singletons and the average number of nucleotide differences between pairs of sequences. This difference in how they’re derived makes it possible for one statistic to be positive while the other is negative in the same population.
We can interpret these statistics like so:
If the statistics are strongly negative, this shows an excess of singletons in the population. This means there are quite a few people whose genomic variants don’t match each other. Such a situation could be from rapid population growth (meaning everyone is closely related and so there’s been no time for mutation to catch up to the increased numbers) or a selective sweep (see below for a formal definition) further back in the population’s history, meaning that strong directional selection for a certain mutation led to an overrepresentation of that mutation (at the expense of other variants at that SNP).
If the statistics are positive, this indicates an excess of old/ancestral variants, that have been selected for in the past. In other words, there are very few unique variants, and the variants that are present are carried by a lot of individuals.
We will calculate Fu and Li’s D and F statistics for our data today.
Recent positive selection usually happens by selective sweeps (i.e. the rapid spread of a particular haplotype, driven by selection for particular alleles within that haplotype), which not only act upon the SNP that corresponds to the phenotype under selection, but the entire haplotype or linkage region in which that SNP is found. When this happens, the alleles in these selected haplotypes become almost entirely homozygous, because selection on this region of the genome is so strong. An iHS score, or an Integrated Haplotype Score, is a test that can be used to measure the amount of recent positive selection experienced by an allele indirectly by looking for evidence of selective sweeps. This test identifies extended haplotypes (large sets of SNPs on a single region of the chromosome) and looking at how many of the SNPs are homozygous in each haplotype. Mass homozygosity in a particular haplotype is recorded as a high iHS score, while low levels of homozygosity is recorded as a low iHS score.
Learn what Fu and Li’s D and F, Tajima’s D, and iHS scores mean, and how to interpret them.
Learn to use the PopGenome package to calculate neutrality statistics such as Fu and Li’s D and F, and Tajima’s D.
Learn to use the pegas package to explore our Tajima’s D statistic further.
Learn about iHS by working through an example of how to calculate iHS in R, and reflect on what the iHS score for our population’s UCP1 haplotype might look like.
Learn about whether or not selection is happening in our populations based on these statistics, and relate it to environments that may cause these alleles in UCP1 to be selected for or against.
Log in to the SCC and bring up the R Studio window as usual:
module load gcc
module load R
rstudio
Next, we need to install the package PopGenome in our R Studio space, because the SCC doesn’t have it pre-installed.
#We need to install this package
install.packages("PopGenome")
Next, load in the packages we need to use:
library(vcfR)
library(PopGenome)
library(pegas)
Finally, we need to format our data for analysis. PopGenome is a bit funky about how they look for the files in your directory, so we need to move some files around in our folders to make the GENOME object that the PopGenome package will work with. Specifically, if you haven’t already done so, we need to make a new directory within our working directory, and put a copy of our data into that folder. Do this in your SCC space using the mkdir and cp commands (remember to use your population name rather than YRI!):
mkdir YRI
cp YRI.vcf /YRI/YRI.vcf
Now, we can make our GENOME object by directing the function to the folder we just created. Additionally, we’ll make a DNAbin object (as we have done before) to save for later…
Here we’ll make our GENOME object:
YRIgenome <- readData("YRI", format = "VCF")
## | : | : | 100 %
## |====================================================
YRIgenome
## -----
## Modules:
## -----
## Calculation Description Get.the.Result
## 1 readData Reading data get.sum.data
## 2 neutrality.stats Neutrality tests get.neutrality
## 3 linkage.stats Linkage disequilibrium get.linkage
## 4 recomb.stats Recombination get.recomb
## 5 F_ST.stats Fixation index get.F_ST,get.diversity
## 6 diversity.stats Diversities get.diversity
## 7 sweeps.stats Selective sweeps get.sweeps
## 8 MKT McDonald-Kreitman test get.MKT
## 9 detail.stats Mixed statistics get.detail
## 10 MS Coalescent simulation @
## 11 -------------- ----------- -------------
## 12 set.populations Defines the populations
## 13 sliding.window.transform Sliding window
## 14 splitting.data Splits the data
## 15 show.slots ?provided slots?
## 16 get.status Status of calculations
And here we’ll make our DNAbin object:
YRI <- read.vcfR("YRI.vcf")
## Scanning file to determine attributes.
## File attributes:
## meta lines: 130
## header_line: 131
## variant count: 207
## column count: 117
##
Meta line 130 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 207
## Character matrix gt cols: 117
## skip: 0
## nrows: 207
## row_num: 0
##
Processed variant: 207
## All variants processed
YRIdna <- vcfR2DNAbin(YRI)
## After extracting indels, 192 variants remain.
##
Variant 3 processed
YRIdna
## 216 DNA sequences in binary format stored in a matrix.
##
## All sequences of same length: 192
##
## Labels:
## NA18486_0
## NA18486_1
## NA18488_0
## NA18488_1
## NA18489_0
## NA18489_1
## ...
##
## Base composition:
## a c g t
## 0.265 0.321 0.275 0.139
## (Total: 41.47 kb)
The function neutrality.stats from PopGenome conveniently calculates several different neutrality statistics for a given population. We’ll use it now to calculate Fu and Li’s D and F:
#Remember to replace the varible name with your own GENOME object!
neut <- neutrality.stats(YRIgenome)
## | : | : | 100 %
## |====================================================
get.neutrality(neut)
## neurality stats
## pop 1 Numeric,9
#To see results:
neut@Fu.Li.F
## pop 1
## [1,] 0.140358
neut@Fu.Li.D
## pop 1
## [1,] 0.1403109
As we can see here, the YRI population has D and F statistics that are pretty close to 0. This suggests neither an excess of singleton mutations in this population, nor an excess of ancestral mutations. Put simply, the UCP1 region in the Yoruban population from Ibadan hasn’t evolved much in recent history, which we might expect from an African population with little experience with extreme cold temperatures.
Before we move on, think about what the results for your population means, based on how we interpret Fu and Li’s D. Is there an excess of singletons? Is there an excess of old/ancestral mutations? What does that tell you about your population’s history? We will come back to this at the end of class.
Now we’ll look at Tajima’s D! When we ran the neutrality.stats function in PopGenome, it also calcuated our Tajima’s D statistic. We can preview our Tajima’s D results from the PopGenome output…
neut@Tajima.D
## pop 1
## [1,] 0.07245427
But, we can get more useful information by using the tajima.test function in the pegas package. This function will use the DNAbin object that we created earlier as an input. It will not only give us the same D statistic as was calculated by the neutrality.stats function, which is all well and good, but will also give us a P-value associated with our test. This is important, because it will give us a sense of how significant the results are of our neutrality test.
tajima <- tajima.test(YRIdna)
tajima
## $D
## [1] 0.07245427
##
## $Pval.normal
## [1] 0.9422404
##
## $Pval.beta
## [1] 0.8974915
What we see from YRI is, again, what we might expect from a population that has been living in a warm climate for many generations: the UCP1 locus in this population does not appear to be subject to selection. This suggests that the only evolution (remember, that’s a change in allele frequencies over generations, and does not have to be due to selection) happening in the population is selectively neutral.
Interpret your own population’s Tajima’s D value. First, compare the D value given by this function and the one given by the neutrality.stats function. Are they the same? What is the D statistic for your population, and what does it mean in terms of whether and what kind of selection might be occuring on your population? We will revisit this question at the end of the lab.